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c/3 ' Abstract 



Researchers in the biological sciences nowadays often encounter the curse of high- 
dimensionality, which many previously developed statistical models fail to overcome. 
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© ; regarding the response of interest is contained. Subsequent statistical analysis can 

then be made in a lower-dimensional space while preserving relevant information. 
Oftentimes studies are interested in a certain transformation of the response (the 
induced response), instead of the original one, whose corresponding CS may vary. 
When estimating the CS of the induced response, existing dimension reduction 
methods may, however, suffer the problem of inefficiency. In this article, we propose 
a more efficient two-stage estimation procedure to estimate the CS of an induced 
response. This approach is further extended to the case of censored responses. An 
application for combining multiple biomarkers is also illustrated. Simulation studies 
and two data examples provide further evidence of the usefulness of the proposed 
method. 

KEY WORDS: Asymptotic efficiency, Censoring, Central subspace, Classification, 
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1 Introduction 

Consider the problem of inferring the association between the response Y and a p- 
dimensional vector of covariates X. Most statistical methods perform well with a moder- 
ate size of p in comparison with the sample size. Unfortunately, we have trouble in dealing 
with the problem when p gets large, which is usually the case in the biological sciences 
nowadays. To improve statistical analysis, a preprocess is implemented first to reduce 
the number of covariates and then the subsequent statistical analysis is made based on 
those extracted covariates. Sufficient dimension reduction aims to reduce the number of 
covariates while preserving necessary information. Specifically, it searches for a matrix 
T g R pxd such that 

Y]LX\T T X, (1) 

where JL stands for statistical independence and d < p. An equivalent statement is that 
the conditional distribution of Y \ X and Y \ Y T X are the same. In other words, all the 
information contained in X regarding Y can be obtained through the lower-dimensional 
linear transformation T T X. Model ([1]) is very general without any extra specification for 
the conditional distribution of Y given X. It trivially holds when V is set to be the identity 
matrix and, hence, is useful only when d is adequately small. Obviously, it is span(T) that 
is of interest to us, which is called the dimension reduction subspace (Cook, 1994; Li, 1991) 
for the regression of Y on X. Under very general conditions, the intersection of all such 
dimension reduction subspaces, denoted by Sy\x, is still a dimension reduction subspace 
(Cook, 1994) and is called the central subspace (CS). We thus assume in the sequel the 
existence of Sy\x = span(r) with structural dimension dim(iSy|x) = d. There have been 
many methodologies proposed to estimate Sy\x-, beginning with the development of sliced 
inverse regression (SIR) of Li (1991), including sliced average variance estimation (SAVE) 
of Cook and Weisberg (1991), third-moment estimation of Yin and Cook (2003), inverse 
regression (IR) of Cook and Ni (2005), directional regression (DR) of Li and Wang (2007), 
discretization-expectation estimation of Zhu et al. (2010), among others. 



Oftentimes, researchers are interested in the induced response Y g = g(Y) for a known 
function g(-) instead of the original one. For example, the original response Y in the Car- 
diac Arrhythmia Study is a categorical random variable with value 1 referring to normal 
heart rhythm and values 2-16 for different types of arrhythmia. In the phase of population 
screening, however, one would merely like to distinguish patients with arrhythmia (Y > 1) 
from those without it (Y = 1). In this case, g(Y) = I(Y < 1) is of major interest, where 
/(•) is the indicator function. Taking the Angiography Cohort Study as another example, 
researchers aim to predict a patient's 10-year vital status. In this study, coronary artery 
disease (CAD)-related death time Y is the original response, and the induced response of 
interest is g(Y) = I(Y < 10). A far more complicated form of g(-) may, instead, be of 
interest, depending on the nature of the study. 

Similar to ([I]), there must exist for every g(-) a T g E R px 9 such that 

Y g ^X\T T g X, (2) 

and one has the central subspace Sy g \x = span(T 5 ) for the regression of Y g on X with the 
structural dimension dim(«Sy 9 |x) = d g . We must have Sy g \x Q <Sy\x since Y g is a function 
of Y, but a more complicated inclusion structure could exist. The following three examples 
demonstrate various relationships between Sy\x an d Sy \x with Y g = I(Y < t). 
Example 1. Assume the conditional distribution of Y given X is 

Y | X ~ Gamma(2exp(a T X), 0.5) (3) 

which satisfies ([1]) with T = a. It is easy to show that ([2]) also holds with T g = a. In this 

case, <Sy g \x = Sy\x for every t. 

Example 2. Assume the conditional distribution of Y given X is 

logF | X ~ N (-alX/alX, {a T 2 X)- 2 ) (4) 

which satisfies (JTJ with Y = [a±, a 2 ]. Provided aJX > 0, pr(F 9 = 1 | X) is a function 
of TJX, which satisfies (J2]) with Y g = ot\ + (logt)«2- In this case, Sy g \x C Sy\x and the 
direction of Sy g \x changes as t varies. 



Example 3. Let the conditional hazard function of Y given X be of the form 

\(y | X) = I(y < n) exp(af X) + /(n <y< r 2 ) exp(a^X) + I(t 2 < y) exp(ajx) (5) 

which satisfies ([I]) with Y = [«i, 02,03]. Moreover, pr(Yg = 1 | X) is a function of TjX, 
which satisfies ([2]) with T 9 = [a 1 , /(£ > Ti)a 2 , I(t > T 2 )a 3 }. In this case, Sy g \x Q <Sy\x 
and 5y \x expands up to Sy\x as t increases (i.e., the dimension also changes). 

These examples highlight the importance of <Sy g \x, because both the dimension and di- 
rection of the CS of Y g may be different from the original CS, i.e., Sy\x ma y contain redun- 
dant directions if we are interested in Y g only. If we simply treat (Y g , X) as the observed 
data, any dimension reduction method can be directly applied to estimate Sy g \x- From a 
statistical point of view, however, Y must contain more information than Y g does, therefore 
this direct method may suffer the problem of inefficiency. We use model Q to demon- 
strate the potential drawback of the direct method. Set a = (1,2, 0) T and generate X from 
iV(03, 0.8 13 + 0.2 I3IJ), where I a represents the ax a identity matrix, l a and a are ax 1 
vectors of ones and zeroes. Since Sy g \x = <Sy\x, SIR is implemented to estimate span(r 9 ) 
based on (Y,X) and (Y g ,X) separately with t = t 50 , where t a satisfies pr(Y < t a ) = a%. 
The first element of the estimates is always forced to be one since only the direction is 
relevant. Simulation results with sample size 300 and 500 replications performed give 
the means and standard errors of the estimates as (1.000, 2.030 ± 0.261, 0.003 ± 0.115) T 
under (Y g ,X), and (1.000, 1.995 ± 0.071,0.001 ± 0.030) T under (Y,X). Although both 
methods can accurately estimate the true direction (1, 2, 0) T , the standard errors for SIR 
based on (Y g , X) are larger. We detect even larger biases and errors for other choices of 
t, especially for t near the boundaries. The main theme of this paper is thus to propose 
a more efficient estimation procedure for Sy g \x based on (Y,X). 

2 A Two-Stage Estimation Procedure 

Some notation is introduced first. For a square matrix A, let Eig(A; a) be the function 
which maps A into its a leading eigenvectors. The observed data (Yj,Xj) is a random 



copy of (Y,X). Following the setting of Cook and Ni (2005), we may assume Y has a 
finite support {1, • • • , h}. In the case of a continuous response, it can be categorized 
as suggested by Li (1991). Let Z = E _1 ' 2 (X — /x) be the standardized version of X, 
where \x = E[X] and E = cov(X). Owing to 'E~ 1 ' 2 Sy\z = <Sy\x an d T^ x l 2 Sy g \z = £y 9 |x> 
there is no difference in considering the dimension reduction problem under Z-scale. In 
this section, we will consider the estimation of B and B g , the basis of Sy\z and Sy \z, 
respectively, and transform back to the original scale via T = Y,~ l ' 2 B and T g = T,~ 1 ' 2 B g . 
In practice, Z is replaced with Z = S~ 1 / 2 (X — jx) by plugging in the usual moment 
estimators jx and E. The structural dimensions d and d g are assumed to be already 
known. The selection of (d, d g ) will be discussed later. 

We start by reviewing a general estimation procedure for Sy\z- Most dimension re- 
duction methods aim to construct a symmetric kernel matrix K (if K is not symmetric, 
KK T is used instead) based on (Y, X) satisfying the property 

span(K) = S Y \ Z . (6) 

A basis of Sy\z is then given by B = Eig(J^; d). At the sampling level, B is estimated by 
B = Eig(i^; d), where K is a sample analogue of K. For example, SIR considers 

Ksir = E-^ME" 1 / 2 , M = cov(E[A | Y]) = (m - fil T h )D f (m - fil T h ) T , (7) 

where m = [mi, ■ ■ ■ ,m h ] with m 8 = E[X \ Y = i], f = (f x , ■ ■ • , fh) with fa = pr(F = 
i), and Df = diag(/). A sample analogue Ksm is obtained by plugging the moment 
estimators rh, f, jx, and E into -Ksm- It should be noted that property ((6]) does not 
hold without any cost. Depending on the choice of K, different conditions are imposed 
to ensure its validity. Inverse regression methods, such as SIR, commonly assume the 
linearity condition ((Al): E[Z | A T Z] is a linear function of Z for any matrix A), which 
is equivalent to assuming the ellipticity of X (Eaton, 1986). 

Turning to the estimation of Sy g \z for any given g(-), parallel to (jEJ), based on {Y g , X) 
we find the symmetric kernel matrix K g satisfying 

span(i^ 9 ) =S Yg \ z , (8) 



and the basis of Sy g \z which is of major interest is defined to be B g = Eig(K g ; d g ). The 
direct estimation method then substitutes an estimator K g for K g , and estimates B g by 
Eig(i^ 9 ; d g ). Similar to ((Z|), K g of SIR is given by 

K g , slR = TT^MgTT^ 2 , M g = cov(E[X | Y g \) = (m g - filJ)D fg (m g - /ilf) T , (9) 

where m g = [m gl , ■ ■ ■ ,m gs ], m gi = E[X \ Y g = i], f g = (f gX , • ■ ■ , f g8 ), f gi = pr(Y g = i), 
and s is the number of categories of Y g . Note that s < h since Y g is a function of Y. 
The sample analogue K g ,sm. can be obtained by plugging the moment estimators rh g , f g , 
ft, and S into i^ 9! sm- We have seen in the end of Section [T] that direct estimation based 
on (Y g ,X) may lose information, and we attempt to propose a more efficient estimation 
procedure. First observe that under the validity of (jSJ) and (jHJ), we must have 

K g = P B K g P B , (10) 

where P B = BB T is the orthogonal projection matrix onto span(5). Although (fTUj) is 
straightforward, it motivates us to estimate K g by P B K g P B , where P B = BB T is an 
estimate of P B . It is the projection P B that utilizes the extra information in (Y,X), and 
results in an expected gain in efficiency. Details of the procedure are listed below: 

1. Based on (Y,X), apply a dimension reduction method to obtain K and, hence, P B . 

2. Based on (Y g ,X), apply a dimension reduction method to obtain K g . 

3. Estimate B g by B g = Eig(P B K g P B ; d g ). 

With B g obtained, we then estimate a basis of Sy g \x, say T g , by r g = S _1 / 2 5 9 . The n 1 / 2 - 
consistency of T g is a direct consequence provided K and K g are also n 1//2 -consistent. We 
call the two-stage estimation procedure "A-B" hereafter, if method A is used in Step 1 and 
method B in Step 2. As SIR is the most widely applied dimension reduction method, the 
following theorem, which guarantees that SIR-SIR is more efficient than SIR, highlights 
the desirability of using our two-stage estimation procedure. We use "acov" to denote 
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the asymptotic covariance, and A > to indicate A is positive semi-definite. The proof 
is deferred to the Appendix. 

Theorem 1. Let T g be obtained from SIR-SIR, and let T g = £ _1 / 2 Eig(Ar 9i giR,; d g ) be the 
direct estimate of T g from SIR. In addition to the linearity condition (Al) above, assume 
the validity of (A2): cov(u T Z | B T Z) is non-random for any vl. Sy\z- Then, 

A = acov f n 1 ' 2 vec(r s — T g ) ) — acov [n l ' 2 vec{T g — T g ) ) > 0. 

The equality holds if and only if span(A^sm) f] span(A" S m - A 5iS ir) = {0}, where K S m 
and Kg^iR are defined in ([7]) and Qj. 

In the establishment of Theorem [TJ in addition to the linearity condition we require 
cov(v T Z | B T Z) to be non-random for any v in the complement of Sy\z- These conditions 
are not that restrictive and can be generally satisfied. As argued by Li and Wang (2007), 
(Al)-(A2) are shown to approximately hold when p is large. Moreover, (A2) is valid 
when X is normally distributed. Although normality is a stronger condition, it can be 
approximated by making a power transformation of X . One implication of Theorem [I] 
is that the total asymptotic variance of T g is strictly larger than that of T g provided 
A^O. The only possibility of no efficiency gain (i.e., A = 0) is when span(A 9 sm) an d 
span(A"sm — Kg,sm) have no common element except the zero point. This is reasonable 
since, under this situation, all the information about Sy g \z contained in Asir resides in 
A Si sir and knowing the "residual" ( Asir — A 9) sir) contributes nothing to the construction 
of Sy \z- Hence, we will gain nothing from SIR-SIR. A formal test for this condition is 
beyond the scope of this article and will be investigated in a future study. In summary, 
SIR-SIR is expected to perform well in most of the situations except the rather restrictive 
special case. This fact is also demonstrated by our simulation studies in Section 4, where 
the efficiency gain of the two-stage method is obviously detected. 

The structural dimensions d and d g should be determined before practical implemen- 
tation. To estimate d, most methods rely on a sequence of hypothesis tests (Li, 1991; 
Cook and Lee, 1999, Cook and Yin, 2001). These methods, however, may not be readily 
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applicable for the selection of d g . To simplify the estimation procedure, we alternatively 
suggest two approaches to select (d,d g ). One is to adopt the maximal eigenvalue ratio 
criterion (MERC) proposed by Luo, Wang, and Tsai (2009). Let Aj be the eigenvalue of K 
and define pi = Aj/Aj+i for 1 < i < p—1. It is proposed to select d by d = argmaxx<j< d * fa, 
where d* is a pre-specified constant. The authors suggest using d* = 5 in practice. Once 
d is obtained, we can estimate d g by a similar procedure. Let X 9ji be the eigenvalue 
of PbK 9 Pb and define p g ^ = \ g ,i/\ g ,i+i for 1 < i < d — 1. Then d g is determined by 
d g = argmax 1<i<( ^_ 1 p 9ti . As to the second method, note that the purpose of dimension re- 
duction is to improve regression or classification. Thus, it is natural to select (d, d g ) so that 
a measure of classification accuracy is maximized. In Section \5\ below, the classification 
accuracy obtained from cross-validation is used in the Cardiac Arrhythmia Study, while 
the AUC (area under the receiver operating characteristic (ROC) curve) is considered in 
the Angiography Cohort Study to select (d,d g ). 

Remark 1. In our two motivating examples, Y g = I(Y < t) is binary and, hence, due to 
its nature, SIR can capture at most one direction of Sy g \z- Alternatively, we can adopt 
SAVE in Step 2. Cook and Lee (1999) showed that for a binary response, SAVE is more 
comprehensive than SIR. The kernel matrix of SAVE is 

j^save = [£- 1/2 (//*i - no) , £~ 1/2 (£*i - s i0 )s- 1 / 2 ] (li) 

with p ti = E[X | Y g = i] and E ti = cov(X \Y g = i),i = 0, 1. Its sample analogue -R^save 
is obtained by plugging moment estimators fi t0 , p t i, S i0 , S tl , and S into ( TTTj) . 

3 Extension to Censored Response 

Dimension reduction is usually applied in the field of life science when the response of 
interest Y represents the survival time of a subject. An important issue in survival analysis 
is that the response may be censored. The exact survival time Y (and hence Y g ) may not 
always be observed and we can only observe (Y*, 5, X) instead, where Y* = min{Y, C} is 
the last observed time, S = I(Y < C) is the censoring status, and C is the censoring time. 



Motivated from two data examples in Section 1, our aim here is to modify SIR-SAVE 
to estimate Sy 3 \x with the specific choice Y g = I(Y < t) under the validity of totally 
independent censorship C JL (Y,X). The modified SIR-SIR will also be illustrated. We 
note that totally independent censorship is satisfied in the Angiography Cohort Study, 
since most of the patients are subject to Type-I censoring. 

Both SIR and SAVE in Steps 1-2 should therefore be modified. For SIR in Step 1, 
observe that S(y*,s)\z C <S(y,c)\z = <Sy\z, where the first inclusion property holds since 
(Y*,6) is a function of (Y,C), and the last equality is true by the totally independent 
censorship assumption. Thus, we suggest using the modified kernel matrix 



K* slR = S-^M^S" 1 / 2 , M* = cov(E[X | Y*,5]) = (m* - f,l T ho+hl )D r ( m * - nl T ho+hi ) 



T 



where m* = [m* 0>1) ,--- ,m* ioho) ,m* (ll) , ■ ■ ■ ,m* {lM) ] with m* (id) = E[X \ 5 = i,Y* = j], 

f* = (/(o,i)' ' ' ' ' f(PM)> All)' ' ■ ■ , f{iM)} T with f(ij) = pr(5 = ^ Y *= J)> and ^o < h and 
h\ < h denote the number of categories of Y* when 5 = and 5 = 1. Here the slice 

means, the m% -s's, are formed within those patients with 5 = and 5 = 1 separately. By 

plugging in moment estimators m*, /*, pi, and E, the sample analogue K^ m is obtained. 

This double slicing procedure was originally proposed by Li, Wang, and Chen (1999), and 

our point is to emphasize its validity under totally independent censorship. 

With regard to implementing SAVE in Step 2, we can still use the kernel matrix 

Kg,SAVE in ( ITT]) provided it can be estimated based on (Y*,6,X). First observe that 



Ju^dS xr (u,t) 



fi[*»ln = 0] = - % vy( — , ' ,< = !■ 2, (12) 

E[x „ ]Yt = 1] = J*Wzd* -°°> - W» , = ( 

1 - S X y{— oo,t) 
where a® 1 = a and a® 2 = aa T for a vector a, and Sxy(x,u) = pr(V > x, Y > y). Here 
">" is interpreted as component- wise for a vector. It implies the nu's and E^'s in (TTTj) 
are functionals of Sxy(x, y). Campbell (1981) and Burke (1988) have separately proposed 
two different estimators of Sxy(XiV)i denoted by S^y(x,y) and S XY (x,y). By plugging 
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Sxy{ x iV) into (|12l) and S XY (x, y) into ffT3|) . we can estimate /i^'s and E^'s by 

a*»- Er=i / ( r r >t) > L »-- ^ y(t) -W > 

_ 1 y. W(y? < t) ^ _ lA xf 2 ^/(r;<t) 

" tT (I - Sy(t)}Sc(Y*) ' fl ~ n ^ { 1 _ &(*)}&(!?) VHli ' 

where Sy(y) and Sc(y) are Kaplan-Meier estimators of pr(F > y) and pr(C > y). Finally, 
a modified estimator of K g> sAVE is given by 



K* 



S" 1/2 (/iu - AIo) , E" 1/a (E^ - ^o)S" 1/2 _ • 
The modified SIR-SAVE is then proposed by using K§ m and K* SAVE in Steps 1-2. 

Remark 2. For binary Y^, Cook and Lee (1999) showed that the population kernel matrix 
of SIR can be expressed as S~ 1 / 2 (/i tl — fi t0 ). The modified SIR-SIR is then proposed by 
using K* gSlR = ±~ l l 2 (fi* tl - fi* ) in Step 2. 

4 Simulation Studies 

We use models fl!])-© to evaluate the performance of our two-stage estimation procedure 
under different combinations of sample sizes (n = 50,100), number of covariates (p = 
10, 20), and censoring rates (CR = 0%, 25%). With censored data, the modified procedure 
is implemented instead. To measure the closeness of two spaces with basis A and A f , 
we adopt the Frobenius norm tr{(P/i — Pa>){Pa — Pa')} 1 ' 2 , where Pa is the orthogonal 
projection matrix onto span(A). Simulations are repeated 500 times. 

For model fll, set ot x = (3, 0.9, -1.5, 0j_ 3 ) T and a 2 = (3,4.5, 6, 0j_ 3 ) T . We inde- 
pendently generate u and r from N p (0,I p ) and Beta(1.8, 0.3), and define X = \i + 
T}' 2 ru{u T u)- xl2 with £ = 0.8 J p + 0.21 p lJ and ft = (0, 3, 0, 0j_ 3 ) T This ensures the 
ellipticity of X. For the censored case, C is generated from Gamma(2, 1.71) so that 
CR= 25%. Both SIR-SIR and SIR are implemented at t — t 30 , t 50 , and t 70 . As for the 
case of model ©, we set a x = (20, 0, 0, 0j_ 3 ) T , a 2 = (0, 15, 0, 0j_ 3 ) T , a 3 = (0, 0, 10, 0j_ 3 ) T , 
and (ti,t 2 ) = (log 2, log 8), generate X from N p (-0.2 ■ l p ,D(0.8I p + 0.2 • l p lJ")D) with 
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D = diag(2, 1, 1, lp_ 3 ), and generate C from Gamma(l,8) to produce CR= 25%. We 
implement SIR-SAVE and SAVE at t = £45, t^, and £75 so that d g = 1, 2, and 3. Various 
choices of the slicing number were examined and produced a similar result. We thus use 
h = 10 for SIR-SIR and SIR-SAVE, and (ho, h) = (5, 10) for the modified methods. 

Simulation results are provided in Table 1. Compared with the standard setting 
(n,p, CR) = (100, 10, 0%), an overall observation is that SIR-SIR and SIR-SAVE outper- 
form SIR and SAVE, even for the cases of smaller sample size (n = 50), of more "noise" 
covariates (p = 20), and of censored response (CR= 25%). The magnitude of efficiency 
gain from SIR-SIR is roughly the same for every t in model (jl]). Interestingly, the effi- 
ciency gain from SIR-SAVE in model (^ becomes greater for larger t. One reason is that 
the structural dimension of Sy g \x also increases as t does. With more directions needing 
to be estimated, more information is required to recover Sy \x, an d we gain more from the 
two-stage estimation procedure. It has been found empirically that SAVE is less efficient 
than SIR. Li and Zhu (2007) showed that SAVE will not attain n^-consistency in gen- 
eral, while SIR will, even if the number of samples in each slice is only 2. By combining 
SIR and SAVE, we expect an efficiency gain from SIR-SAVE as shown in this simulation. 

5 Data Examples 

5.1 The Angiography Cohort Study 

Detailed description of the data can be found in Lee et al. (2006). Briefly speaking, for 
each of 1050 traceable patients, four biomarkers (CRP, SAA, IL-6, and tHcy) and the 
CAD-related time of death were recorded with the aim of using the combined biomarkers 
to accurately predict a patient's t-year vital status, and thus the induced response of inter- 
est is Y g = I(Y < t). Hung and Chiang (2010) analyzed this data, combining biomarkers 
via the extended generalized linear model (EGLM): P(Y < t | X) = G(t, P?X), where /3 t 
is a p x 1 time- varying coefficient vector and G(-, •) is an unknown link function which is 
monotone increasing in its two arguments. Under EGLM, fifX is promised to be optimal 
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in distinguishing {Y < t} from {Y > £}, in the sense that the time-dependent ROC curve 
(Heagerty, Lumley, and Pepe, 2000) is the highest among all functions of X. 

The EGLM also satisfies (j2j) with Y g = I(Y < t), Sy g \x = span(T 9 ) = spa,n((3 t ), and 
d g = 1. Thus, T^X is also the optimal biomarker since any monotone transformation of 
fiJX will have the same time- dependent ROC curve. Given that a censoring mechanism 
is involved in this study, the modified SIR-SIR is applied to obtain T g in order to combine 
the biomarkers. We enter the transformed biomarker Xi/sd(Xi) to perform our analysis. 
The analysis results with d = 3 and (ho,h\) = (2,4) are found in Table 2. We remind 
the reader that the choice of these tuning parameters attains the maximum of the time- 
dependent AUC as mentioned in Section [2J The absolute coefficient of CRP is smallest 
at the beginning and increases as time goes by. SAA has a totally different behavior, 
where it has a larger effect initially but seems to be diminishing at 3500 days. Both IL-6 
and tHcy are found to play important roles in predicting patient's vital status over time. 
Interestingly, CRP has a reverse effect as compared with the other three biomarkers. 
Table 2 provides the time- dependent AUC of the composite biomarkers T^X at day t, 
denoted by At (see equation (8) of Chiang and Hung, 2010). The larger the At values, the 
higher prediction power TjA has. One can see that most of the At values are greater than 
0.7, especially at the beginning of the study. We also calculated At values, the maximal 
time-dependent AUC of the method developed in Hung and Chiang (2010), and a similar 
pattern to that of the At values was detected (note that At < A* t will always hold for 
every t). In summary, SIR-SIR is easy to implement and achieves acceptable AUC values. 

5.2 The Cardiac Arrhythmia Study 

The study consisted of 452 patients, each with 279 covariates. The response Y & {1, ■ ■ ■ , 16} 
is a categorical random variable, where 1 refers to "normal" and 2-16 refer to different 
classes of arrhythmia. See Giivenir et al. (1997) for details. 

To keep matters simple, we consider continuous predictors only and use their first 100 
principal components in our analysis. We are interested in distinguishing normal patients 
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{Y = 1} from abnormal ones {Y > 1}, i.e., Y g = I(Y < 1). The scatterplots of the 
extracted predictors (denoted by SSI,- • • ,SS5) from SIR-SAVE with (d, d g ) = (7, 5) are 
provided in Figured] Again, the selection of (d, d g ) is such that the averaged classification 
accuracy from cross-validation is maximized. It can be seen that SS1-SS3 demonstrate 
their ability to separate two groups via variation, while SS4-SS5 attempt to separate two 
groups via location. In every subplot, the normal group seems to have smaller variation 
and locates in the center of a relatively large data cloud of the abnormal group. The 
bottom-left 10 subplots of Figure [1] are scatterplots of those extracted predictors taken 
from SAVE directly. It can be seen that there is only a separation pattern of variation 
between the two groups, but no obvious location difference. To further evaluate the 
performance of those extracted predictors, we randomly separate the data into a training 
set (90%) and a test set (10%), and then implement quadratic discriminant analysis based 
on those extracted predictors. The procedure with 200 replications gives SIR-SAVE the 
averaged classification accuracy of 78%, while it is a mere 70% for SAVE. 

6 Discussion 

Although we have considered univariate responses only, there is nothing different about 
carrying out the procedure with multivariate responses, except that the kernel matrices K 
and K g are constructed for multivariate responses Y and g(Y). A multivariate response 
version of Theorem [1] can be derived with a proof analogous to the proof of the univariate 
case. We refer to Li, Wen, and Zhu (2008) for some recent developments in dimension 
reduction with multivariate responses. We note that the proposed two-stage estimation 
procedure is a general framework, and is not limited to any specific method. Depending on 
the purpose of a given study, we may adopt any dimension reduction technique in either 
Steps 1 or 2 of the procedure. Besides SIR-SIR and SIR-SAVE, we also tested various 
combinations of SIR, SAVE, IR, and DR. Simulation results (not shown here) all convey 
the same message that an efficiency gain is significantly detected, which provides evidence 
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that the superiority of the two-stage procedure comes mainly from using PbK 9 Pb, and is 
not limited to any specific choice of dimension reduction method. 
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APPENDIX 

Let Ei = cov[X \Y = i], E 9i = covLY | Y g = j], J = (J ls ■ ■ • , J h ) T with J t = I(Y = i), 
J g = (J g i, ■■ , Jgsf with J gj = I(Y g = j) , E[J] = /, and E[J g ] = / fl . There must exist 
a code matrix G = [G±, • • • , G s ] with G, G M h containing only zeros and ones such that 
J g = G T J . We may assume /i = without loss of generality and, hence, M = mDfm T and 
M g = m g Df g m?. From the definitions of i^sm an d -^ 9 ,siR; we have T = Eig(S~ 1 M; d) 
and T g = Eig(E~ l M g ;d g ). Similarly, f = Eig(S" 1 M; d), f g = Eig(E- 1 M 9 ; d g ), and 
T g = Eig(S _1 P T M g P; d g ), where P = rr T S is an estimator of P = rr T S which is the 
projection matrix onto span(r) relative to the S-inner product. 

Proof of Theorem \J\ By P T M g P = M g and delta method, it suffices to show 

$ = acov ({J*) _ acov (JJ n ) > 0, 

where U n = n l / 2 vec(t~ l P T M g P - Y>- l M g ) and U* = n 1 / 2 vec(S^ 1 M 9 - Tr x M g ). We 
first derive the weak convergence of U n . Let Hq(M,M 9 ,Ti) = T,~ 1 P T M g P. One has 
H = dvec(H (M, M g , E))/5vec([M, M g , E]) = [H u H 2 , H 3 ] by Lemma 4.1 of Tyler (1981), 
where E x = (J p ®E- 1 )(/ p 2+T PiP ){g T ®(M 9 M+)}, H 2 = P T ®(E- 1 P T ), H 3 = -{M g ll~ 1 )® 
E _1 , ® is the Kronecker product, T PtP = ^f 7 =i ^' ® -^/ ^ s ^ ne commutation matrix with 
Eij being a p x p matrix with a one in the (i,j) position and zeroes elsewhere, M + is the 
Moore-Penrose inverse of M, and Q = I p — P. From Lemma 1 below and delta method, 
U n = n 1 / 2 vec( J ff (M, M g , t)-H (M, M g , E)) converges weakly to N(0, HWH T ), where W 
is defined in Lemma 1. As to the weak convergence of U*, define Hq(M, M g , E) = E _1 M 5 
and its differential with respect to [M, M g , E] is calculated to be H = [0, H2, H 3 ] with 
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H 2 = (I p ® E^ 1 ). A similar technique gives U* = n l/2 vec(H (M , M g , E) - # (M, M fl , E)) 
which converges weakly to N(0,HWH T ). 

The difference of the asymptotic covariance matrices is ^ = HWH T — HWH T = 
Yf i=1 *i with *! = H 2 W 22 Hl - H x W lx Hl - H 2 W 22 HJ, V 2 = H 2 W 23 Hj + H 3 W 32 SJ - 
HxWmHI - H 3 W 31 H? - H 2 W 23 Hl - H 3 W 32 H^, and * 3 = -H x W 12 El - H 2 W 21 H? . It 
is shown in Lemma 2 that ty 2 = 0. Moreover, Lemma 3 implies ty 3 = 0. Hence, \1/ = \l/i 
and we are left to show ^1 > 0. By Lemma 3 and Q T m = Q T m g = 0, 

*! = {I p <g> E" 1 )^ + T P , P ){(M 5 - M g M + M g ) ® (Q T EQ)}(/ p2 + T p , p )(/ P ® E" 1 ). 

Since Q T HQ > and is not a zero matrix, it remains to show M g — M g M + M g > 0. Let 
M* = M - M g . Since Y g is a function of Y, E[X \ Y g ] = E{E[X | Y] \ Y g } and, hence, 
M* = E[cov(E[X I Y] I Y g )] > 0. It further implies M g -M g M+M g = M g (M g + M*)+M*. 
By Lemma 4 of Anderson and Duffin (1969), we have M g (M g + M*) + M* > which 
proves ^ > 0. The equality holds if and only if M g (M g + M*) + M* = 0, if and only if 
span(M s ) f] span(M*) = {0} by Lemma 3 of Anderson and Duffin (1969), if and only if 
span(i^sm) D span(if SIR - K^sir) = {0}. □ 

Lemma 1. As n goes to infinity, n 1/2 vec([M, M g , E] - [M,M g ,T,]) 4- N(0,W), where 
the asymptotic covariance matrix W = [Wij], 1 < i,j < 3, is defined in the proof. 

Proof. The limiting distributions of sample covariance matrix are the same no matter we 
know the true mean fi = or not. Thus, we consider E = n _1 Y17=i XiXj and adopt a 
similar strategy of Saracco (1997) to complete the proof. 

Let u = ( J T , (J <g> X) T , X T , Jj, (J g ® X) T , (X ® X) T ) T , E[u] = /i u , E u = cov(u), and 
u = - ^27=1 Ui w ith UiS being random copies of u. By the central limit theorem we have 
n l / 2 {u — fi u ) — y A r (0, E u ). Consider F which maps (a, (61, • • • , bh), c, d, (ei, • • • ,e s ), /) to 
vec([Eti^(^-c)(^-cr, ELi^(|-c)(|-cF,/]) for a =(a 1 ,---,a,fGM^,6 4 G 
W, ceW,d= (di, . . . , 4) T e M s , e^ e M p , and /ef 2 . By delta method, we deduce 
that n 1 / 2 vec([M, M 9 , E]-[M, M 9 , E]) = n 1 / 2 ^^)-^^)) converges weakly to N{0, W) 
with W = FH U F T , where F is the differential of F Q at \i u . A direct calculation then 
gives W u = E lC ov(J, J)E? + £ 2 diag(/f %, • ■ ■ , f^T, h )E^ W 22 = E 3 cov(J g , J g )Ej + 
J E 4 diag(/ 9 - 1 1 E 9l , • • ■ , f-^ g sM, W 33 = cov(X®X), W 12 = E 1 {cov(J, J g )(E 3 +C* mg E,) T + 

17 



f- 

Jgs 


m gs ) 


, E x 


= 
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cov(J, (Dj g l J g ) ® X)Ej} + P 2 diag(Ei, ■ • ■ , Z h ){(GDj*) ® J p }Pj, W/ 23 = P 3 cov(J s ,X ® 
X) + E 4 [{(Dj*& r D f ) <g> J p }$ - C^E^P^ 1 J 9 ) (8) (X ® X) T }], and W 13 = E lC ov(J, X ® 
X) + P 2 ($ - CmEHD/j) (gi (X ® X) T ]), where $ = EfpJ 1 J) <g> {X(X ® X) T }], C m = 
diag(mi,--- ,m h ), C mg = diag(m 9 i,--- ,m gs ), C* mg = diag(/ g " 1 1 m sl , ■ • • , 
[mi® mi, • • • ,m h <8>m h ], E 2 = (I p 2+T PiP ){(mDf)®I p }, E 3 = [m gl ®m gl , 
and E 4 = (J p2 + T PtP ){(m g D fg ) <g> I p }. 

Lemma 2. Under (A1)-(A2), ^ 2 = 0. 

Proof. From Q T m = Q T m g = and Q T EP = 0, we have \& 2 = ^20 + ^20 witn ^20 = 
{I p ® E- 1 )^ + ^{(m^D, - M g M + mD f ) <8> I p }{(I h ® Q T )$(P g> / P )}P 3 T , and it 
suffices to show \P 20 = 0. From span(r) = Sy\x and (Al), we have Q T E[X(X T <S> X T ) \ 
Y = i]{P®I p ) = E{(X T P)®cov(Q T X I T T X) \Y = i} = (mfP)<g>(Q T EQ) by Lemma 4. 
It further implies (I h <8> Q T )$(P <g> I p ) = (m T P) <g> (Q T EQ). Substituting this into ^ 20 
and using (m g G T D f - M g M + mD f )m T = M g - M g = to conclude ^ 20 = 0. D 

Lemma 3. Under (A1)-(A2), Q T E;Q = Q T T, gj Q = Q T EQ and Q T E;P = Q T Y> 9] P = 0. 

Proof. Note that E, = E[cov(Q T X | T T X) \ Y = i] + cov(P T X \ Y = i) by (Al) and 
span(r) = Sy\x- The result is proved by Lemma 4. The case of E gj is similar. □ 

Lemma 4. Under (A1)-(A2), cov(Q T X | T T X) = Q T EQ. 

Proof. From (Al), cov(Q T X | T r X) = £(X)Q T Y,Q for some positive function £(■). Also, 
E = E[cov(X I T T X)] + cov(E[X | T T X]) implies Q T EQ = E[cov(Q T X | T T X)]. These 
two facts gives E[£(X)] = 1. Note that Q T EP = implies span(E 1/2 Q) 1 span(E 1/2 P) = 
S Y \z and, hence, cov(Q T X | T T X) = cov(Q T Y}l 2 Z | B T Z) is non-random by (A2). 
Hence, we must have £(•) = 1 which completes the proof. □ 
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Table 1 
Averages of Frobenius norms under different t and (re,p, CR) for models ([4|)-(j5]) 



Model- 


(4) 




(100, 10, 0%) 


(100, 20, 0%) 


(100, 10, 25%) 


(50, 10, 0%) 


£30 




SIR-SIR 


0.241 


0.320 


0.343 


0.326 






SIR 


0.358 


0.558 


0.451 


0.515 


£50 




SIR-SIR 


0.181 


0.278 


0.317 


0.265 






SIR 


0.309 


0.490 


0.408 


0.455 


£70 




SIR-SIR 


0.239 


0.323 


0.357 


0.333 






SIR 


0.363 


0.558 


0.469 


0.521 


Model- 


(5) 




(100, 10, 0%) 


(100, 20, 0%) 


(100, 10, 25%) 


(50, 10, 0%) 


£■15 




SIR-SAVE 


0.572 


0.805 


0.581 


0.815 






SAVE 


0.676 


1.042 


0.697 


1.002 


*65 




SIR-SAVE 


1.022 


1.449 


1.101 


1.391 






SAVE 


1.354 


1.705 


1.415 


1.572 


£75 




SIR-SAVE 


1.129 


1.600 


1.365 


1.538 






SAVE 


1.775 


2.176 


1.844 


1.952 
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Table 2 

r„ and the time- dependent AUC values At and A% at different time points t 



t CRP SAA IL-6 tHcy A t A* t 

1000 -0.400 0.580 0.465 0.643 0.748 0.760 

1500 -0.532 0.560 0.605 0.573 0.735 0.744 

2000 -0.495 0.579 0.573 0.578 0.733 0.745 

2500 -0.619 0.529 0.690 0.531 0.693 0.708 

3000 -0.695 0.488 0.759 0.499 0.709 0.724 

3500 -0.735 0.165 0.652 0.705 0.670 0.675 
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Figure 1: The scatter plot matrix of extracted predictors from SIR-SAVE (upper trian- 
gular panel) and SAVE (lower triangular panel) with (d,d g ) = (7,5). The green pluses 
and black dots indicate the normal and abnormal patients. 
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